Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIOAsyncWriter: Fix suspending yield when writer is finished #3044

Conversation

orobio
Copy link
Contributor

@orobio orobio commented Jan 9, 2025

This fixes hitting the following preconditionFailure in NIOAsyncWriter: "This should have already been handled by yield()".

It doesn't expect a yield to be suspended when the state is .writerFinished, but this can definitely happen. This seemed to be the correct solution to me, but please check it carefully, because I'm unfamiliar with this code.

I'm not happy about the test for this. It's blocking the didYield call for a while, to make sure the correct state is reached. See also the 'FIXME' line. What would be a better way to do this?

Copy link
Contributor

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great catch, thanks so much!

Regarding your test, it might be useful for the test to use a new delegate that makes it easier to arrange the ordering to work out. NIO's ConditionLock can be used to set up a nice ordering chain, so that instead of using the usleeps we can just block on that lock being unlocked with the right value.

@orobio orobio force-pushed the NIOAsyncWriter-Fix-suspending-yield-when-writer-finished branch from bf18517 to f15a87b Compare January 14, 2025 08:05
@orobio
Copy link
Contributor Author

orobio commented Jan 14, 2025

@Lukasa : I updated it with ConditionLock. Is this what you had in mind?

This is still blocking a thread from the concurrency pool and then depending on another thread to unblock it. Will that not be a problem? The only way I see around that is to add an async hook to NIOAsyncWriter.

@Lukasa
Copy link
Contributor

Lukasa commented Jan 14, 2025

For testing purposes, that blocking will be safe. We will eventually resolve it, and we don't rely on Dispatch to make that happen (it requires the NIO thread to do it), so it'll be ok.

@Lukasa Lukasa added the 🔨 semver/patch No public API change. label Jan 14, 2025
Copy link
Contributor

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really lovely patch, thanks so much @orobio!

@Lukasa Lukasa enabled auto-merge (squash) January 14, 2025 15:12
@Lukasa Lukasa merged commit 6fb31ea into apple:main Jan 14, 2025
34 of 35 checks passed
@orobio
Copy link
Contributor Author

orobio commented Jan 14, 2025

For testing purposes, that blocking will be safe. We will eventually resolve it, and we don't rely on Dispatch to make that happen (it requires the NIO thread to do it), so it'll be ok.

I'm not sure I fully understand what you're saying here, but I assume it's fine then.

I do still have that 'FIXME' line in there, which would probably be good to remove. Well..., this got merged while I was typing, so I'll open a new pull request for that.

Edit: Opened #3059

orobio added a commit to orobio/swift-nio that referenced this pull request Jan 14, 2025
This was left in when pull request apple#3044 was merged, but as discussed on
that pull request, this doesn't need to be fixed.
Lukasa pushed a commit that referenced this pull request Jan 14, 2025
This was left in when pull request #3044 was merged, but as discussed on
that pull request, this doesn't need to be fixed.
@finagolfin
Copy link
Contributor

It appears this pull gets the NIO tests to hang in the Android x86_64 emulator, but weirdly only when compiled and run on macOS, ie the linux CI runs shown there work fine. I don't know if that's a macOS cross-compilation issue or some incompatibility with the macOS emulator triggered by this pull, possibly both. Reverting this pull gets my Android CI runs on macOS to work again.

Let me know what you guys think.

@Lukasa
Copy link
Contributor

Lukasa commented Jan 16, 2025

That’s very interesting. Do you have a smaller repro? The OS-specificity makes me think this is probably a Swift bug, but if it’s easy enough to reproduce we may be able to figure out if the test that hangs is just flaky.

@finagolfin
Copy link
Contributor

A bit tough for me to reproduce since I don't use macOS. @marcprux, any interest in looking into this on your Mac?

@marcprux
Copy link

marcprux commented Jan 16, 2025

When I test against a local Android emulator (with skip android test) on my ARM macOS 15 machine with your released 6.0.3 Android SDK, the NIOAsyncWriterTests tests (which appears to be where your CI is hanging) do all pass:

Test Suite 'NIOAsyncWriterTests' passed at 2025-01-16 09:23:46.416
	 Executed 33 tests, with 0 failures (0 unexpected) in 0.022 (0.022) seconds

However, the test run does ultimately fail due to a leaked promise.

Edit: see next comment.

I also set up an Android CI in a fork of swift-nio and both macos-13 and ubuntu-24.04. In the run, macOS hangs like you are seeing and Ubuntu fails with the same leaked promise fatalError. I tried both with and without applying your patches in the workflow.

I'm not sure why the CI's macOS testing hangs but my local tests complete, but one difference is that the CI's Android emulator is API 29 and mine is API 34. I'll try with some different Android API levels and see if that changes anything.

Another possibility is architecture: my local emulator is running on aarch64, but the CI runs x86_64 (which is why we are running on macos-13: it is the last public macOS image to support x86). I won't have a macOS x86 machine handy, but I can get access to one later on, so I can try that out if you suspect that it could be architecture related.

Here's the leaked promise, although it is likely a red herring:

Test Case 'ChannelTests.testAskForLocalAndRemoteAddressesAfterChannelIsClosed' started at 2025-01-16 09:23:53.419
NIOPosixTests/TestUtils.swift:697: Fatal error: leaking promise created at (file: "NIOPosixTests/TestUtils.swift", line: 697)
Current stack trace:
0    libswiftCore.so                    0x00000075619ad878 _swift_stdlib_reportFatalErrorInFile + 128
1    libswiftCore.so                    0x00000075617cfb64 _assertionFailure(_:_:file:line:flags:) + 244
2    swift-nioPackageTests.xctest       0x00000061dced52f0 <unavailable> + 19673840
3    swift-nioPackageTests.xctest       0x00000061dcf77a40 <unavailable> + 20339264
4    swift-nioPackageTests.xctest       0x00000061dcf77e34 <unavailable> + 20340276
5    swift-nioPackageTests.xctest       0x00000061dcee49ec <unavailable> + 19737068
6    swift-nioPackageTests.xctest       0x00000061dcee4aac <unavailable> + 19737260
7    libswiftCore.so                    0x000000756191a79c <unavailable> + 4884380
8    libswiftCore.so                    0x000000756191b274 <unavailable> + 4887156
9    swift-nioPackageTests.xctest       0x00000061dda7ca48 <unavailable> + 31894088
10   swift-nioPackageTests.xctest       0x00000061dda705b0 <unavailable> + 31843760
11   swift-nioPackageTests.xctest       0x00000061dda7334c <unavailable> + 31855436
12   swift-nioPackageTests.xctest       0x00000061dda754fc <unavailable> + 31864060
13   swift-nioPackageTests.xctest       0x00000061dda7cf88 <unavailable> + 31895432
14   swift-nioPackageTests.xctest       0x00000061dda7bab4 <unavailable> + 31890100
15   swift-nioPackageTests.xctest       0x00000061dda7b4c8 <unavailable> + 31888584
16   swift-nioPackageTests.xctest       0x00000061dd7611d4 <unavailable> + 28635604
17   swift-nioPackageTests.xctest       0x00000061dd76e1b8 <unavailable> + 28688824
18   libXCTest.so                       0x000000755b6ccbcc <unavailable> + 236492
19   libXCTest.so                       0x000000755b6ce564 XCTAssertNoThrow<A>(_:_:file:line:) + 68
20   swift-nioPackageTests.xctest       0x00000061dd7907cc <unavailable> + 28829644
21   swift-nioPackageTests.xctest       0x00000061ddd54300 <unavailable> + 34874112
22   swift-nioPackageTests.xctest       0x00000061dce81c6c <unavailable> + 19332204
23   swift-nioPackageTests.xctest       0x00000061ddd829ac <unavailable> + 35064236
24   libXCTest.so                       0x000000755b6c5c44 <unavailable> + 207940
25   libXCTest.so                       0x000000755b6c50bc <unavailable> + 204988
26   libXCTest.so                       0x000000755b6c5bac <unavailable> + 207788
27   libXCTest.so                       0x000000755b6c5b8c <unavailable> + 207756
28   libXCTest.so                       0x000000755b6c5f00 <unavailable> + 208640
29   libXCTest.so                       0x000000755b6c405c <unavailable> + 200796
30   libXCTest.so                       0x000000755b6c6164 XCTestCase.invokeTest() + 356
31   libXCTest.so                       0x000000755b6c5f48 XCTestCase.perform(_:) + 128
32   libXCTest.so                       0x000000755b6c8a54 XCTest.run() + 164
33   libXCTest.so                       0x000000755b6c706c XCTestSuite.perform(_:) + 164
34   libXCTest.so                       0x000000755b6c8a54 XCTest.run() + 164
35   libXCTest.so                       0x000000755b6c706c XCTestSuite.perform(_:) + 164
36   libXCTest.so                       0x000000755b6c8a54 XCTest.run() + 164
37   libXCTest.so                       0x000000755b6c706c XCTestSuite.perform(_:) + 164
38   libXCTest.so                       0x000000755b6c8a54 XCTest.run() + 164
39   libXCTest.so                       0x000000755b6c40ec XCTMain(_:arguments:observers:) + 312
40   libXCTest.so                       0x000000755b6c42c4 XCTMain(_:) + 40
41   swift-nioPackageTests.xctest       0x00000061ddddf170 <unavailable> + 35443056
42   swift-nioPackageTests.xctest       0x00000061ddddf190 <unavailable> + 35443088
43   swift-nioPackageTests.xctest       0x00000061ddddf1a8 <unavailable> + 35443112
44   libc.so                            0x000000755d8f25a8 __libc_init + 108
Trap 

@marcprux
Copy link

Actually, scratch all that. It seems to be hanging randomly, on both macOS and Ubuntu runners, irrespective of API level. See the runs at https://github.com/marcprux/swift-nio/actions/runs/12812276673

I'm still not able to get it to hang on my local machine, but I'll keep trying.

@Lukasa
Copy link
Contributor

Lukasa commented Jan 16, 2025

Does it always hang on the same test?

@marcprux
Copy link

No, it seems to vary. Sometimes in NIOAsyncSequenceProducerTests.testYield_whenStreaming_andNotSuspended_andDemandMore , sometimes in NIOAsyncWriterTests.testSetWritability_whenInitial. It's hard to tell exactly because I think there's some output buffering going on (because sometimes the log will just stop at an incomplete line like:

Test Case 'NIOAsyncSequenceProducerTests.testYield_whenS
Error: The operation was canceled.

@Lukasa
Copy link
Contributor

Lukasa commented Jan 17, 2025

Yeah, that seems fairly likely. Can you run the underlying test binary directly? Potentially running it under stdbuf -o0 to try to disable output buffering entirely.

@finagolfin
Copy link
Contributor

On my Android CI, it is entirely consistent: linux CI runs pass every time while macOS runs always hang. See the run yesterday: across two attempts with this pull, I had four macOS hangs and six linux runs pass (ignore the devel macOS failures, that's a separate issue). So much so that today I only reverted this pull for macOS, finagolfin/swift-android-sdk@7476f2be5, and all the runs passed.

@Lukasa
Copy link
Contributor

Lukasa commented Jan 17, 2025

@finagolfin What happens if you only revert the test?

@finagolfin
Copy link
Contributor

I don't think this test is the issue, as it appears to be failing elsewhere in the log, but I will check that and let you know.

@finagolfin
Copy link
Contributor

Nope, you were right to suspect the test, as simply reverting this test on macOS got the NIO tests to pass again. 😄

Marc, can you confirm?

marcprux added a commit to marcprux/swift-nio that referenced this pull request Jan 18, 2025
@marcprux
Copy link

Marc, can you confirm?

Confirmed: if I remove the test, then the hangs never occurs in the emulator (running on macOS-13 or Ubuntu).

@finagolfin
Copy link
Contributor

@marcprux, are you disabling all NIOAsyncWriterTests though, as shown in the linked commit above, marcprux/swift-nio@87d02310d? If so, it would be better to just disable this new test, as I did above, to see if this code change affects the older tests too, which you are no longer running.

I'm going to try running the linux-compiled tests on the macOS emulator and vice versa next, to see if that helps narrow down where this strange hang is coming from, as it could be that we're hitting a bug in the Android emulator for macOS.

@Lukasa
Copy link
Contributor

Lukasa commented Jan 20, 2025

This could be an emulator bug, but it could also be that we're seeing an ordering issue flushed out by the performance overhead of the emulator. It'd be useful to try to get a "sequence of events" log from your hanging test as well, as concurrent tests are horrible to debug.

@finagolfin
Copy link
Contributor

finagolfin commented Jan 24, 2025

Alright, I ran the NIO tests alone in the Android emulator, once swapping the mac-compiled tests into the linux-hosted emulator and vice versa, once keeping the mac-built tests in the mac-hosted emulator and so on, where only the mac version hung.

Looking at the results from the swapped run, the macos-hosted emulator again hung each time, with the linux-compiled tests not finishing for 20 minutes before I killed it (I have a 5 macOS job limit on the OSS free tier of github Actions, so I killed it once it seemed to hang, so other jobs could proceed). I just kicked off a swapped run again that I will let run for longer to make sure it does hang after longer (just killed it again after the macos-hosted emulators all hung for more than an hour and a half).

This means the problem is in the interaction between this test and the macOS-hosted Android emulator, since the linux-compiled test runs fine in the linux-hosted emulator but hangs in the mac-hosted emulator. The mac cross-compile and mac-hosted Android emulator are consistently slower on github Actions, because of greater contention for fewer macOS CI machines or whatever, so either that lack of speed screws up this test or there is a bug in the macOS-hosted Android emulator.

Regarding "a 'sequence of events' log," you want me to spray this test with print statements to figure out what's running when?

@Lukasa
Copy link
Contributor

Lukasa commented Jan 27, 2025

Yeah, that was what I was asking for. Presumably this is the result of a specific ordering sequence caused by the slowdown in the emulator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔨 semver/patch No public API change.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants